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Abstract 

We introduce a new criterion to determine the order of an autoregressive model 
fitted to time series data. It has the benefits of the two well-known model selection 
techniques, the Akaike information criterion and the Bayesian information criterion. 

When the data is generated from a finite order autoregression, the Bayesian informa¬ 
tion criterion is known to be consistent, and so is the new criterion. When the true 
order is infinity or suitably high with respect to the sample size, the Akaike informa¬ 
tion criterion is known to be efficient in the sense that its prediction performance is 
asymptotically equivalent to the best offered by the candidate models; in this case, the 
new criterion behaves in a similar manner. Different from the two classical criteria, the 
proposed criterion adaptively achieves either consistency or efficiency depending on the 
underlying true model. In practice where the observed time series is given without any 
prior information about the model specification, the proposed order selection criterion 
is more flexible and robust compared with classical approaches. Numerical results are 
presented demonstrating the adaptivity of the proposed technique when applied to 
various datasets. 

Keywords: Adaptivity; AIC; Autoregresseive model; BIC; Consistency; Efficiency; Model 
selection; Parametricness index 


1 


1. INTRODUCTION 


In a practical situation of the autoregressive model fitting, the order of the model is generally 
unknown. There have been many order selection methods proposed, following different 
philosophies. Anderson’s multiple decision procedure (Anderson 1962) sequentially tests 
when the partial autocorrelations of the time series become zero. The final prediction error 
criterion proposed by Akaike (1969) aims to minimize the one-step prediction error when the 
estimates are applied to another independently generated dataset. Bhansali & Downham 
(1977) generalized the final prediction error criterion by replacing 2 with a parameter a 
in its formula, and proved that the asymptotic probability of choosing the correct order 
increases as a increases. The well-known Akaike Information Criterion, AIC (Akaike 1998), 
was derived by minimizing the Kullback-Leibler divergence between the true distribution and 
the estimate of a candidate model. Some variants of AIC, for example the modified Akaike 
information criterion that replaces the constant 2 by a different positive number, have also 
been considered (Broersen 2000). Nevertheless, Akaike (1979) argued in a Bayesian setting 
that the original AIC is more reasonable than its variants in a practical situation. Hurvich & 
Tsai (1989) proposed the corrected AIC for the case where the sample size is small. Another 
popular method is the Bayesian information criterion, BIC, proposed by (Schwarz 1978) that 
aims at selecting a model that maximizes the posterior model probability. Hannan & Quinn 
(1979) proposed a criterion, HQ, that replaces the log IV term in BIC by c log log N(c > 1), 
where N is the sample size, and they showed that it is the smallest penalty term that 
guarantees strong consistency of the selected order. The focused information criterion is 
another remarkable approach that takes into account the specific purpose of the statistical 
analysis, by estimating the risk quantity of interest for each candidate model (Claeskens & 
Hjort 2003; Claeskens, Croux & Van Kerckhoven 2007). Other methods for autoregressive 
order selection include the criterion autoregressive transfer function method (Parzen 1974), 
the predictive least-squares principle (Rissanen 1986; Hemerly & Davis 1989), the combined 
information criterion (Broersen 2000); see de Gooijer, Abraham, Gould & Robinson (1985) 
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and Shao (1997) for more references. Despite the rich literature on autoregressive models, 
the most common order selection criteria are AIC and BIC. 

In this paper, the specified model class for fitting is the set of autoregressions with orders 
L = 1 ,...,L max for some prescribed natural number L max . In relation to the true data 
generating process, the model class is referred to as well-specified, (or parametric) if the data 
is generated from a finite order autoregression and the true order is no larger than L max , 
and mis-specified (or nonparametric) if otherwise. It is well known that BIC is consistent in 
order selection in the well-specified setting. In other words, the probability of choosing the 
true order tends to one as the sample size tends to infinity. The Akaike information criterion 
is not consistent and has a fixed overfitting probability when the sample size tends to infinity 
(Shibata 1976). However, AIC is shown to be efficient in the mis-specified setting, while BIC 
is not (Shibata 1980). Here we call an order selection procedure (asymptotically) efficient if 
its prediction performance (in terms of the squared difference between the prediction and its 
target conditional mean) is asymptotically equivalent to the best offered by the candidate 
autoregressive models. A rigorous definition of efficiency is given in Section 4.2. In other 
words, AIC typically produces less modeling error than BIC when the data is not generated 
from a finite order autoregressive process. Furthermore, asymptotic efficiency of AIC for 
order selection in terms of the same-realization predictions for infinite order autoregressive 
or integrated autoregressive processes has also been well established (Ing & Wei 2005; Ing, 
Sin & Yu 2012). 

In real applications, one usually does not know whether the model class is well-specified. 
The task of adaptively achieving the better performance of AIC and BIC is theoretically in¬ 
triguing and practically useful. There have been several efforts towards this direction. Yang 
(2005) considered the possibility of sharing the strengths of AIC and BIC in the regression 
context. It has been shown under mild assumptions that any consistent model selection cri¬ 
terion behaves suboptimally for estimating the regression function in terms of the minimax 
rate of convergence. In other words, the conflict between AIC and BIC in terms of achieving 
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model selection consistency and minimax-rate optimality in estimating the regression func¬ 
tion cannot be resolved. But this does not indicate that there exists no criterion achieving 
the pointwise asymptotic efficiency in both well-specified and mis-specihed scenarios, be¬ 
cause the minimaxity (uniformity over the linear coefficients) is intrinsically different from 
the (pointwise) efficiency. In the remarkable work by Ing (2007), a hybrid selection proce¬ 
dure combining AIC and a BlC-like criterion was proposed. Loosely speaking, if a BIC-like 
criterion selects the same model at sample sizes I\r (0 < t < 1) and N, then with high prob¬ 
ability (for large N ) the model class is well-specified and the true model has been converged 
to, and thus a BIC-like criterion is used; otherwise AIC is used. Under some conditions, 
the hybrid criterion was proved to achieve the pointwise asymptotic efficiency in both well- 
specified and mis-specihed scenarios. In estimating regression functions with independent 
observations, Yang (2007) proposed a similar approach to adaptively achieve asymptotic effi¬ 
ciency for both parametric and nonparametric situations, by examining whether BIC selects 
the same model again and again at different sample sizes (instead of only two sample sizes 
used by Ing (2007)). Liu & Yang (2011) proposed a method to adaptively choose between 
AIC and BIC based on a measure called parametricness index. In the context of sequential 
Bayesian model averaging, Erven, Griinwald & De Rooij (2012) and van der Pas & Grunwald 
(2014) used a switching distribution to encourage early switch to a better model and offered 
interesting theoretical understanding on its simultaneous properties. Cross-validation has 
also been proposed as a general solution to choosing between AIC and BIC. It was shown 
by Zhang & Yang (2015) that, with a suitably chosen data splitting ratio, the composite 
criterion asymptotically behaves like the better one of AIC and BIC for both the AIC and 
BIC territories. 

In this paper, we introduce a new model selection criterion which is referred to as the 
bridge criterion (BC) for autoregressive models. The bridge criterion is able to address the 
following two issues: First, given a realistic time series data, an analyst is usually unaware 
of whether the model class is well-specified or not; Second, even if the model class is known 
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to be correct, the order (dimension) is not known, so that any prescribed finite candidate 
set suffers the risk of missing the true model. We show that BC achieves both consistency 
when the model class is well-specified and asymptotic efficiency when the model class is 
mis-specihed under some sensible conditions. Recall that the penalty terms of A1C and B1C 
are proportional to L for autoregressive model of order L. In contrast, a key element of BC 
is the expression 1 + 2 _1 + • • • + L _1 employed in its penalty term. As we shall see, it is 
the harmonic number that “bridges” the features of AIC and BIC. Another key element is 
to let L max grow with sample size. We emphasize that for the well-specified case, once the 
true order is selected with probability close to one, the resulting predictive performance is 
also asymptotically optimal/efficient. From this angle, the criterion achieves the asymptotic 
efficiency for both the well-specified and the mis-specihed cases. 

The outline of this paper is given below. In Section 2, we formulate the problem con¬ 
sidered in this paper and briefly introduce the background and how the new criterion was 
heuristically derived. In Section 3, we propose the bridge criterion and give an intuitive 
interpretation of it. We establish the consistency and the asymptotic efficiency property 
in Section 4. Numerical results are given in Section 5 comparing the performance of our 
approach and other techniques. In Section 4.3, we propose a two-step strategy to adaptive 
choose the candidate size L max , in order to further relax the conditions required by the the¬ 
orems established in previous sections. To that purpose, we also extend the expression of 
the bridge criterion. Finally, we make some discussions in Section 6. 

2. BACKGROUND 

2.1 Notation 

We use o p ( 1) and O p ( 1) to denote any random variable that converges in probability to zero, 
and that is stochastically bounded, respectively. We write hjv = O (gjv) if c < Hm/qn < 1/c 
for some positive constant c for all sufficiently large N, and h n = O(^jv) if |Av| < cgjy 
for some positive constant c for all sufficiently large N. If lini/v^oo In/9n = 0, we write 
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/ = ojv(g), or for brevity, / = o(g). Let |_^J denote the largest integer less than or equal to 
x. Let A f(n, a 2 ), B(a, 6), xt respectively denote the normal distribution with density function 
f(x ) = exp{ — (x — p) 2 /(2o' 2 )}/(\/27ro'), the Beta distribution with density function f(x) = 
x a ~ l (l — x) b ^/B(a,b), where £>(-,•) is the beta function, and the chi-square distribution 
with k degrees of freedom. 

2.2 Problem formulation 

Given observations {x n : n = 1,..., iV 0 }, we consider the following autoregressive model of 
order L (L e N) 


Xn + i’L,lXn-1 H-h i>L,LX n -L = e n , 


( 1 ) 


where ipLj G M (£ — 1 ,... , L), ipL,L 7^ 0, the roots of the polynomial z L + X^=i 
have modulus less than 1, and e n ’s are independent and identically distributed according 
to Af(0,a 2 ). The autoregressive model is referred to as AR(L) model, and [ipL, i,- • • 
is referred to as the stable autoregressive filter Let L 0 denote the true order, which is 
considered to be finite for now. In other words, the data is generated in the way described by 
(1) with L = Lq. When L 0 is unknown, we assume that {1,..., L max } is the candidate set of 
orders. Let N = No — L max . The sample autocovariance vector and matrix are respectively 

7 l = [ 7 i,o, • • •, 7 l,o] t , f l = [ 7 ij]fj=i- where 7 id = ± En=L max +i x n-iX n -j (0 <i,j < L max ). 
The filter of the autoregressive model of order L can be estimated by 

*L = -fz'T L, (2) 

which yields consistent estimates (Box, Jenkins & Reinsel 2011, Appendix 7.5). The one-step 
prediction error is eL = x +i( x n + '$L,i x n-i + ''' T t/fL,rAn— l) 2 /AT. For convenience, we 
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define eo = 7 o,o- The error of the AR(L) model can be calculated by 


e L = e 0 (3) 

Let 7 i-j = E{x n -iX n -j} (■ i,j e Z) be the autocovariances and 4 >l— [ipL, i, • • • be the 

best linear predictor of order L. In other words, 4/^ (L > 1) is the minimum of 

e L = min EUx n + ^* L i^ n _i H- ^* L L x n _ L )' 2 }, (4) 

where the expectation is taken with respect to the stationary process {X„j. In addition, we 
define e 0 = 7 o- The values of 4' j J and can be calculated from a set of equations similar to 
(2)-(3), by removing the hats (A) from all parameters. 

Given an observed time series, the problem is how to identify the unknown order of the 
autoregressive model fitted to the data. The Akaike information criterion and the Bayesian 
information criterion for autoregressive order selection is to select L (1 < L < L max ) that 
respectively minimizes the quantities Aic(iV, L) = logez, + 2 L/N, BIC (N,L) = logey, + 
Llog(N)/N. In the following two subsections, we introduce the motivation and perspective 
that naturally led to the bridge criterion. The formal expression of BC and its performance 
in asymptotic regions are established in Sections 3 and 4. 

2.3 Motivation 

Distinct from AIC or BIC, the new criterion was initially derived from some perspectives 
unique to autoregressions. Briefly speaking, it was initially motivated by postulating that 
nature randomly draws the coefficients of true autoregressions from a non-informative uni¬ 
form distribution and by fixing the type I error in a sequence of hypothesis tests on the order. 
Suppose that we generate a time series to simulate an AR(L 0 ) process using ( 1 ). Clearly, 
e-i > ■ ■ ■ > eL 0 ~i > e,L 0 > &l 0 +i > • • • > e/^ max . Because of (3) and the consistency of 
generally is large for L < L 0 and is much smaller for L > L 0 . If we plot against L 
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for L — 1,..., L max , the curve is usually decreasing for L < L 0 and becomes almost flat for 
L > L 0 . Intuitively, the order L may be selected such that becomes ‘less significant” 

than its predecessors for L > L. We define the empirical and theoretical gain of goodness of 
fit using AR(L) over AR(L — 1), respectively, as 

» i=log (ir)* 9 L= ' og (^f)' (5) 

Suppose that the data is generated by a stable filter T j jQ of order L 0 . For any positive 
integer L that is greater than L 0 and does not depend on N, it was shown by Anderson 
(1971, Theorem 5.6.2 and 5.6.3) that y/N[^L 0 +i,L 0 +u ■ ■ ■ j has a limiting joint-normal 

distribution J\f( 0, 1) as N tends to infinity, where / denotes the identity matrix. In addi¬ 
tion, the random variables NgL (L — L 0 + 1,..., L max ) are asymptotically independent and 
distributed according to Xv where L max > L 0 is a constant that does not depend on N 
(Shibata 1976). Next, we revisit AIC and BIC by associating them with a sequence of 
hypothesis tests. The purpose of the argument below is to motivate our new criterion. 

Test: We choose a fixed number 0 < q < 1 as the significance level (or the type I error), 
and thresholds s such that q = pr (W > s), where W ~ xf- Consider the hypothesis test 

Hq : Lq = L — 1 H\ : Lq > L. (6) 

If NgL > s (or equivalently s/N ~ cjl < 0), we reject H 0 and replace L — 1 by L, for 
L = 2, 3,... until L = L max or H 0 is not rejected. One limitation of this hypothesis test 
technique is that it may produce extreme values (Akaike 1970). A straightforward alternative 
solution would be to select the L such that the aggregation of s/N — <?i ,..., s/N — c/l is 
minimized, i.e., to select the global minimum: 

L s sL 

L = argrnin - g k ) = loge L + — - loge 0 , (7) 

1 <L<L max iV 



the objective function of which can be regarded as the goodness of fit pins the penalty 
of the model complexity. The penalty term is a sum of thresholds s and — logeo- The term 
— loge 0 does not depend on L, so it has no effect on the produced result and is negligible. 
The Akaike information criterion has a penalty term 2L/N, it therefore corresponds to the 
above hypothesis tests with q =0.1573 . The Bayesian information criterion has a penalty 
term Llog(IV)/.ZV. It corresponds to the hypothesis tests with varying q. As an illustration, 
the significance levels q of BIC under different sample sizes are tabulated in Table 1. 


N 

100 

500 

1000 

2000 

10000 

q 

0.0319 

0.0127 

0.0086 

0.0058 

0.0024 


Table 1: Significance level q of the Bayesian information criterion at different sample sizes 

To motivate our new criterion, suppose that nature generates the data from an AR(L 0 ) 
process, which is in turn randomly generated from the uniform distribution Ul 0 . Here, Ul q 
is defined over the space of all the stable AR filters whose roots have modulus no larger than 
r (0 < r < 1): 

S L (r) = |t l : z L J r^'f L ,ez L ~ e = - a t ), ^l,i e R, K| < r, i = 1,..., l\. (8) 

Under this data generating procedure, gj j is a random variable with distribution described 
by the following theorem. For the sake of continuity, we postpone a detailed discussion on 
Ul 0 to the Supplementary Material. 

Theorem 1 Suppose that 4% is uniformly distributed in Sb(l). Then, ifig, ..., f>L 0 ,L 0 
are independently distributed according to {f>L,L + 1)/2 B([L/2 + lJ,L(L+l)/2J) (L = 

1,..., L 0 ). Furthermore, Lfif L and Lg L converge in distribution to xf as L tends to infinity. 

Similarly, we postulate hypothesis tests in the opposite direction (for a given L max ): 

Hq : Lq = L Hi : Lq < L — 1. (9) 


9 




Under the null hypothesis, gL 7 ^ 0 almost surely, and we approximate the distribution of 
gL by that of g L . We choose a fixed number 0 < p < 1 as the significance level, and the 
associated thresholds Hl at order L such that p = pr (gL < Iil), or equivalently 


hL = 



( 10 ) 


where F ^ 1 (•) denotes the inverse function of the cumulative distribution function of gL. If 
gL < hi (or equivalently cjl — h L < 0), we reject H 0 and replace L by L — 1, for L = 
L max , L max — 1,... until L = 2 or H 0 is not rejected. Likewise, the L that minimizes the 
following objective function can be chosen as the optimal order 


L 


^max 

arg min 

1 <L<L max k=L+1 


L 

h k ) = log e L + ^ h k + c 

k =1 


( 11 ) 


where c = — (logeL max + X)fc=T does not depend on L. The next subsection introduces 
the proposed criterion motivated by ( 11 ). 


2.4 Proposed order selection criterion 

From now on, we allow the largest candidate order to grow with the sample size N, and 
use notation Lm2 c instead of L max to emphasize this dependency. Define N = N 0 — 
Building on the idea of (11), we adopt the penalty term h k (p) where h k (p) is defined 

in ( 10 ), and p is further determined by 


h i w>» = Jf- 


( 12 ) 


Theorem 1 implies that h k (p) ^ F 2 1 (p)/k for large k, where F 2 1 (-) denotes the inverse 

Xi X\ 

function of the cumulative distribution function of xi- From (12) we have F~ 2 1 (p) & 2L^ g > x /N, 
and thus h k (p) ~ 2LmaL/(Nk). We therefore propose the following bridge criterion: select 
the L G {1,..., c} that minimizes loge L + (2 L^Jk/N) Ylk=i 1 A- 
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We have seen that given a fixed type I error, the threshold for hypothesis test (6) is a 
constant, while the threshold for (9) decreases in L leading to the 1/k term. Intuitively 
speaking, the uniform distribution on Sl(t) concentrates more around the boundary of the 
space, and the loss of underfitting, eL-i/eL = 1/(1 — L ), becomes more negligible, as L 

increases. To some extent, this observation suggests an interesting idea that the penalization 
for different models is not necessarily linear in model dimension; one may start with a BIC- 
type heavy penalty, but alleviate it more and more to an AlC-type light penalty as the 
candidate model is larger, offering the possibility of changing/reinforcing one’s belief in the 
model specification. 


3. BRIDGE CRITERION 
Recall that the estimated order L by bridge criterion is 

2_z/ a 9 l ^ 

L = argrnin BC (n,L) = loge^ H-V'' — (13) 

i k =i k 

where is the largest candidate order. Lmak must be selected such that limjv-^oo LmaL = 
oo, and its rate of growth will be studied in Section 4. It is well known that J2k =i 1/^ — 
logL + Ce + 0£,(1) for large L, where Ce is the Euler-Mascheroni constant. Fig. 1 illustrates 
the penalty curves for different N and Lm2 c = [loglVj. Without loss of generality, we can 
shift the curves to be at the same position at L — 1. 

Fig. 2 illustrates the penalty curves for the bridge criterion, the Akaike information crite¬ 
rion, the Bayesian information criterion, and the Hannan and Quinn criterion, respectively 
denoted by 


J,c(L) = y] 1 J AIC (L) = 1l, j mc (L ) = Jhq(l) = 1 l 


N ^ k' 

k =1 


N 


N 


where c is chosen to be 1.1, L — 1,..., LmlZ = [logAtj, and N = 1000. Any of the above 
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Figure 1: A graph showing the penalty term for sample size 10 3 (dot-dash), 10 4 (dashes), 
and 10 5 (solid). 

penalty curves can be written in the form of Y2k =i and only the slopes tk (k = 1,..., L max ) 
matter to the performance of order selection. For example, suppose that L 2 is selected instead 
of Li (L 2 > Li) by some criterion. This implies that the gain of goodness of fit — e^ 2 
is greater than the sum of slopes Y^u=l 1 + i^fc- Thus, we have shifted the curves of the latter 
three criteria to be tangent to the log-like curve of the bridge criterion in order to highlight 
their differences and connections. Here, two curves are referred to as tangent to each other if 
they intersect at one and only one point, the tangent point. The tangent points (marked by 
circles) of J AIC , J m and J mc are respectively 6, 2 and 1. Take the curve J HQ as an example. 
The meaning of the tangent point is that BC penalizes more than HQ for k < 2 and otherwise 
for k > 2. 

Given a sample size N, the tangent point between J BC and J HQ curves is at T BC . Hq = 
2LmJ x /(cloglogN). As an example, we choose Lmah = l_l°g A^J. If the true order L 0 is finite, 
Tbc:hq w dl be larger than L 0 for all sufficiently large N. In other words, there will be an 
infinitely large region as N tends to infinity, namely 1 < L < T BC:HQ , where Lq falls into 
and where BC penalizes more than HQ. As a result, asymptotically the bridge criterion does 
not overfit. On the other hand, the bridge criterion will not underfit because the largest 
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Figure 2: A graph showing the penalty curves of the bridge criterion (solid) together with 
the Akaike information criterion (dashes), the Hannan and Quinn criterion (dot-dash), the 
Bayesian information criterion (small dashes), and the tangent points (circled) for N = 1000 


penalty preventing from selecting L + 1 versus L is 2 LmX/N, which will be less than any 
fixed positive number defined in (5) for all sufficiently large N. The bridge criterion is 
therefore consistent. 

The inequality (2LmeL/N)/k < 2/N for any 1 < k < L^X guarantees that BC penalizes 
more than AIC so that it does not cause much overfitting even in the case of small N or 
large L 0 . Since BC penalizes less for larger orders and finally becomes similar to AIC, it 
is able to share the asymptotic optimality of AIC under suitable conditions. To further 
illustrate why the bridge criterion is expected to work well in general, we make the following 
intuitive argument about the model selection procedure. As we shall see, the bent curve of 
BC well connects BIC (or HQ) and AIC so that a good balance between the underfitting 
and overfitting risks is achieved. The rigorous theory will be established in Section 4. 

Intuitive argument: 

To gain further intuition, we consider an insect who is climbing a slope that is determined 
by a particular penalty curve J(L) from the starting point L = 1 to the maximal possible 
end m = l XX (Fig. 3). Fig. 3(a) illustrates J A i C {L) (black small dash) and J m (L ) (blue 
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dash). We only drew J HQ (L) for brevity, as there is no essential difference between the two 
strongly consistent criteria HQ and BIC. 

The climbing scheme mid the goal: At each step L, the insect moves to step L + 1 if 
its gain is larger than its loss, and it will not move any more once it stops. The gain refers 
to the increased goodness of fit to the data (which is g L+l in our autoregressive model), the 
loss refers to the penalty of increased model complexity (which is J(L + 1) — J(L)), and the 
last step where the insect stops is denoted by L. The goal is to design a proper slope such 
that the insect stops at a “desired destination” that will be elaborated on below. 

The tangent points of two slopes: A slope can be written as Y^t= i^fc- The performance 
of the insect is determined by each increment f*., and is not affected if the slope is shifted 
by any constant that does not depend on L. We thus shift the curves J AIC (L) and J m (L) 
to be tangent to the log-like curve of J BC (L). By our design of J BC (L), the tangent points 
between J BC (L) and J A i C (L), J m (L ) curves are respectively at steps T BC:AIC = L^x, T BC:HQ = 
2L|nix/(dog log N). Before step T BC:HQ , the insect on BC slope suffers more loss than on HQ 
slope in each move, while the other way around after step T BC:HQ . 

The well-specified case: Now we categorize two distinct scenarios: where the desired 
destination is within finitely many steps, and where the desired destination is beyond finitely 
many steps. In the former case, there is a clear target step L 0 . A good slope should be 
designed such that the insect stops at step L 0 . It is already known in this case that HQ 
slope is good while AIC slope is not. In fact, it can be illustrated by Fig. 3(a), in which 
the gain after L 0 is O p (l)/N, smaller than ©(loglog N)/N (which is usually guaranteed by 
the law of the iterated logarithm) while larger than 0(1)/N with a positive probability for 
sufficiently large N. How about BC? It is worth mentioning that our argument for the insect 
is implicitly built upon N, and the concept of consistency is about large N asymptotics. 
Suppose that N keeps increasing, the aforementioned tangent step T BC:HQ will be not only 
larger than L 0 but also diverging to infinity given that log log N = o(L nmL)- In other words, 
there is the “blackhole” region [0,T BC:HQ ] (Fig. 3(b) and (c)), in which BC slope is steeper 
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(a) 



(true model) 


Figure 3: (a) Curve J A1C (blue dash) and J HQ (black small dash), (b) the joint plot of J BC (red 
thick line) and J A1C , J HQ , by shifting the latter two to be tangent to J BC at tangent points 
T BC:AIC , T BC:hq (circled), in which T BC:AIC < L 0 , and (c) the evolution of plot (b) to the scenario 
T KC:hq > L 0 as N increases 

than HQ slope, and which grows to be infinitely large. It results in two consequences: First, 
the insect will find it more and more difficult to escape from the region because the increased 
loss from moving each step needs to be compensated by its gain. Take the autoregressive 
models as an example. After moving each step the gain is approximated independent x[/N , 
the expectation of which is less than the loss 2/JV; so the probability of the cumulated sum of 
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gains being larger than that of loss decreases to zero rapidly as the number of steps increases. 
Second, once the insect is trapped in the blackhole, it encounters more difficulty to move 
forward on a BC slope than on a HQ one. Since on the HQ slope the insect will not move 
beyond step L 0 (due to the strong consistency of HQ), on a BC slope it will not, either. 

On the other hand, the insect will not stop before step L 0 . This is because of two facts: 
First, the largest penalty preventing from moving forward is J BC (1) = o(l); Second, the gain 
of the insect moving from step L to L + 1 when L < M 0 is usually at least 0(1) + o p (l) 
(which is true when ipL+i,L+i ^ 0 in autoregressive models). Therefore, the insect stops at 
step Lq on a BC slope. 

The mis-specified case: The fact that T BC:AIC = guarantees that BC slope is always 

steeper than AIC slope so that the insect does not move too far. Because the BC slope is 
in a concave shape, the insect moves easier and easier for larger steps. In the case where 
the appropriate destination tends to infinity, the insect will soon move to the tail part of 
the slope. As one can see from Fig. 3(c), in the tail part the slope is designed to be similar 
to AIC (and it becomes exactly AIC at the end step L = Lmlx), it is possible to share the 
asymptotic optimality of AIC. 

In summary, the bent curve of the BC well connects AIC and HQ so that a good balance 
between the underfitting and overfitting risks can be achieved. We emphasize that the 
above argument does not match exactly to the rigorous proof, since the decision making of 
the insect is carried out sequentially, while the aforementioned criteria select L via global 
optimum. Nevertheless, the argument for the insect does shed some light on why BC is 
likely to perform in the way we desire: to automatically behave like a consistent one while 
the underlying model is well-specified, and an efficient one otherwise, alleviating the risk 
caused by an analyst’s initial prejudice. Besides this, the above argument does not assume 
any concrete probabilistic model, and thus it seems to be a promising criterion for other 
statistical inference as well. 


16 



4. PERFORMANCE OF THE BRIDGE CRITERION 


In this section, we establish rigorous theory on the consistency and efficiency of the bridge 
criterion proposed in (13). We prove its consistency and asymptotic efficiency in Subsec¬ 
tion 4.1 and 4.2, respectively. In Subsection 4.3, we propose an extended bridge criterion 
and its associated two-step strategy, in order to relax some technical assumptions. In view 
of the above intuitive argument, the extended criterion works in the following way. Let the 
insect clime on the AIC slope, and record its ending point Laic! modify the BC increment 
J bc (L) — J bc (L — 1) from 2L[^J x /(NL) to 2 M n /(NL), where M N is slightly smaller than 
L,rmx; let the insect move again on the modified BC slope with boundary L AIC . In this way, 
the insect can still stop at L 0 if it is finite, and otherwise moves faster towards the end L A1C 
as if it were on the AIC slope. 

4.1 Consistency 

Theorem 2 Suppose that the time series data is generated from a finite order autoregression, 
and that 

T (N) T (AO 

J-'max xjmax ~ 

Iim ----— = OO, Iim - 7- = U . 

AT-*x> log log TV N^oo N 2 

Then the bridge criterion is consistent. In additio7i, if L is selected from any finite set of 
integers that does not depend on N and that contains the true order L 0 , then L converges 
not only in probability but also almost surely to L 0 . 

Remark 1 Theorem 2 proves the consistency of bridge criterion under mild assumptions. It 
is worth mentioning that an analyst does not suffer the risk of specifying a finite candidate 
set {1,..., L max } that excludes the true order L 0 . Because any finite true order will be 
eventually included as a candidate and evaluated by bridge criterion, as the sample size 
becomes large. The proof of Theorem 2 is given in the Supplementary Material. 

Remark 2 The proof of Theorem 2 could be adapted in such a way that limjv_ 5 . 00 log log N 
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= oo is not necessary to prove the consistency. It was for proving the strong consistency of 
L if any finite candidate set including the true order L 0 is specified instead. Nevertheless, 
various numeric experiments show that this condition greatly enhances the performance of 
the bridge criterion under finite sample size when applied to the candidate set { 1 ,..., LmJ x }. 

4.2 Asymptotic efficiency 

We introduce the following notation. The matrix norm ||-|| is defined by ||M|| = supi^n = 1 ||M?/|| 2 , 
where ||-|| 2 denotes the Euclidean norm of a column vector. For a positive definite matrix 
A, the norm ||-|| A is defined by |||/|| A = ( y T Ay ) 1 / 2 . If two vectors 2/1 = [2/1,1,..., 2/i,z.i] t and 
y 2 = [2/2,1, •••, 2/2, l 2 ] T are °f different sizes, then we allow subtraction of those vectors by 
modifying the definition in the following way. Given 2/1,2/2, define 2/1,2/2 as vectors of size 
L' = max-fLi, L 2 } by appending maxjl/!, L 2 } — min {L\, L 2 } zeros to the tail of 2/1 or y 2 . We 
define subtraction of y \, y 2 in this case as y[—y' 2 . Similarly, if the size of a vector y is smaller 
than a positive definite matrix A of size k x k, \\y\\ A is the same as \\y'\\ A where y' is of size 
k by appending zeros to the tail of y. 

We are usually interested in the one-step prediction error if a mismatch filter, as defined 
below, is specified (Akaike 1969; Akaike 1970). Assume that the data is generated from a 
filter T £ 0 as in (1). The one-step prediction error of using filter A^ minus that of using the 
true filter is referred to as mismatch error 

E{[x n , ...,x n _ L / + i](T io — A l )} = ||A i: -H/L 0 ||r L , • (15) 

where L' = max{Lo, L} and T jj is the L' x L' covariance matrix of the true autoregression, 
namely its (i, j)th element is 7 \_j. The following assumptions are needed for this section. 

Assumption 1 The data {x n : n = 1,..., N 0 } is generated from the recursion x n +'if ooA x n _i+ 
f/ ; oo, 2 ^n -2 + • • • = e n , where 'ipooj £ Xpii iV’oojI < e n’s are independent and identically 
distributed according to J\f(0,cr 2 ), and the associated power series T(A) = 1 + 2 ^ 00 , i^ 1 + 
'dpoop.z- 2 + • • • converges and is not zero for \z\ > 1 . 
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Assumption 2 {Lmax} is a sequence of positive integers such that LmJ x —> oo and LmJ x = 
o(A 1 / 2 ) as N tends to infinity. 

Assumption 3 The order of the autoregressive process (or the size of filter ^oo) is infinite. 


Remark 3 Assumption 1 is a more general assumption than we had in previous sections. 
Under Assumption 1, we have 0 <70 = || Ti || < ||r 2 || < ••• < ||r||, where T = (yi-j]ij = i is the 


V°° iV°° 'V- -vV 

A^i=i\A^j=i h-jyjS 


nl/2 


infinite dimensional covariance matrix with norm ||r|| = sup || y || 2=1 

Assumption 3 has been assumed in several technical lemmas in (Shibata 1980) that we 
are going to introduce. For those lemmas and the scope of this paper, Assumption 3 can be 
generalized to allow for the case where the order of the autoregressive process, denoted by 
Lo(lV), is finite but depends on N. In other words, the data generating process varies with 
N. In that case, the associated power series that appeared in Assumption 1 may be written 
as = 1 + + • • • + fi>L 0 {N),L 0 {N)Z~ Lo( ' N \ and that assumption is accordingly 

replaced by: Tat(;j) is not zero for \z\ > 1 and it converges as N tends to infinity; an 
additional requirement is the divergence of L* N (introduced below) as N tends to infinity. 


In this section, we show that the proposed order selection criterion asymptotically min¬ 
imizes the mismatch error under certain conditions. Define the cost function Cn(L ) = 
Lo 1 /N + 11T — d'oollp. It can be regarded as the expected mismatch error if an esti¬ 
mated filter of order L is used for prediction. In fact, under Assumptions 1-3, it holds that 
(Shibata 1980, Proposition 3.2) 


lim max 

N—^00 \ <^ T T (-W) 


~ ^oollr _ 
Cn(L) 


0 in probability. 


(16) 


In addition, if we use {L* N } to denote a sequence of positive integers which achieves the 
minimum of Cfi(L) for each N, namely L* N = argrnin (ao Cn(L), then for any random 

lyLULmax 

variable L possibly depending on {x n : n = 1 ,...,N}, and for any e > 0, it holds that 
linijv-xx)pr{||^^ — v h 00 ||p/C , Ar(L^ r ) > 1 — e} = 1 (Shibata 1980, Theorem 3.2). The result 
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shows that the cost of the estimate 'I'(L) is no less than Cn(L* n ) in probability for any order 
selection L. An order selection L is called asymptotically efficient if 

lim = 1 i n probability. (17) 

N ^oo Cn{L n ) 

Equality (17) can be equivalently written as linpv^oo Cn(L)/Cn(L* n ) = 1 in probability in 
view of Equality (16). The following result establishes the asymptotic efficiency of bridge 
criterion in two common scenarios, i.e., where the mismatch error ||\I/£ — d'oollr decays 
algebraically or exponentially in L. The two cases cover a wide range of linear processes as 
we point out in Remark 4. Its proof is given in the Supplementary Material. 

Proposition 1 Suppose that Assumptions 1-3 hold. 

112 

1. Suppose that the mismatch error ||Ti — T^llp satisfies 

loglldT — d'oollp = —7 log L + log cl (18) 

where 7 > 1 is a constant, and the series {cl : L — 1, 2, . ..} is lower bounded by a 
positive constant and cl+i/cl <1 + 7/(T + l). If 

hS = 0(vA-') (19) 

holds for a fixed constant 0 < £ < 1/(1 + 7), then the bridge order selection criterion 
is asymptotically efficient. 

2. Suppose that the mismatch error satisfies the equality 

log||T L - T^llp = -7L + logc L ( 20 ) 

where 7 > 0 is a constant, and the series {cl : L — 1, 2, .. .} is lower bounded by a 
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positive constant and cl+i/cl < q for some constant q < exp( 7 ). If 


L ( tl<—\ogN ( 21 ) 

7 

holds for a fixed constant 0 < e < 1, then the bridge order selection criterion is 
asymptotically efficient. 

Remark 4 To provide an intuition of condition (18), in view of Remark 3 we prove that if 
the order of autoregressive process is not infinity but L 0 (N) (which grows with N ) instead, 
and if T l 0 (N) is uniformly distributed in <Sl 0 (at)( 1) for any given N, then for large L (1 < 
L < L 0 (N )) 


£{log||T L - *L o( A0ll^ o(iV) } = — logL + logLo(lV) + o L (l) . (22) 

The proof is given in the Supplementary Material. Furthermore, it is known that condition 
(20) holds (with constant series cl) when the data is generated from a finite order moving- 
average process (Shibata 1980). 

However, the proposed bridge criterion in (13) is not fully satisfactory in terms of asymp¬ 
totic efficiency. For BC to achieve efficiency, our Proposition 1 requires l UXL to satisfy (19) 
or (21) depending on the underlying mismatch error. This poses two concerns: first, the 
mismatch error as a function of L is usually unknown in advance, and it can be more com¬ 
plex than those characterized by (18) and (20); second, the chosen LmX is not large enough 
to incorporate all possible competitive models into the candidate set; this is because L\ f?X 
is always e-away (in terms of the order) to the minimum of Cn(L) over all positive integers 
L G N. This has motivated us to extend the bridge criterion in such a way that 1) it relaxes 
the conditions required by (18) and (20), and 2) it selects the optimal order from a broad 
candidate set, and 3) it still achieves either consistency in the well-specified case or efficiency 
in mis-specihed case. 
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4.3 Adaptive selection of 

To achieve the aforementioned goal, we propose a general strategy that consists of two steps. 

1 . choose any LnSc = o(y/N) and apply AIC to obtain L AIC ; 

2. within the range 1,2,..., L A1C , select the optimal order (denoted by L BC ) by minimizing 
the modified BC penalty 


bc(A, L ) 


2 M, 


N 


E 

k=l 


(23) 


where M jv is a number to be chosen. 

We note that M N = Lma X was chosen in the previous sections, but it may not be the 
ideal choice in our two-stage approach, as we shall see later. We define 


Lq N ^ — argminC'iv(T). (24) 

lgn 

to be the “universally optimal order”. In most cases Lq N ^ is upper bounded by A 1_£ for a 
fixed £ > 0. For instance, if \\f>L — ^oollr follows the algebraic decay cL ~ 7 for some 7 > 0, 
then L[ N) = ©(A 1 ^ 1 " 1 " 7 ^). Nevertheless, it is possible that L^ > \f~N. 

In the rest of this section, we consider the case L^ in order to: 

1 ) take into account the most competitive model that does not depend on the choice of L[ nix, 
as (24) implies lI N) = argmin (w> C N (L)-, 

2) simplify technical derivations. But we emphasize that this requirement is not essential. 

Assumption 4 In the mis-specified scenario, it holds that L^ < LmJ x - In addition, Cat (A) 
has a well-separated mode in the sense that if liniAr^oo Cn(Ln)/Cn(Lq N ' > ) = 1 holds for a 
sequence L^, then limjsr^oo Ln/L(Lq N ' > ) = 1 . 
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Remark 5 The efficiency of AIC under mis-specified model implies that 

lini/v-Hx) Cn(L aic )/Cn(L^) = 1 in probability which, given Assumption 4, further implies 


Inn —7777 = 1 m probability. 

TV-^oo r(A0 1 J 


(25) 


Assumption 4 is easily satisfied in many common cases. For example, we consider two 
common scenarios that were also described in Proposition 1: the mismatch error has an 
algebraic decay ||4/ L — = cL~ 7 , or an exponential decay ||4/ L — T^llp = cexp(—yL). 

We let (pv = L n /Lq N \ Via straightforward calculation, linpv-j.oo Cn(L n )/Cn(L < 0 n ' > ) = 1 can 
be rewritten as liniAr^oo ?iv 7 (l + 79 jv +1 / (1 + 7) = 1 in the case of algebraic decay, and it can 
be rewritten as liniiv_ ) . 00 exp{— 7 (Lat — L[ ) Ar ^)}/(l + 'yL^) + 7 Ltv /(1 -t-yA^) = 1 in the case 
of exponential decay. In both cases, it follows that linpv-^oo Qn = 1 in probability. 


The following theorem establishes the consistency and efficiency of the two-stage strategy. 

Theorem 3 Suppose that L A1C is obtained from the first step of the two-step strategy, and 
Assumption 2 holds. Suppose that there exists a sequence M N (indexed by N) satisfying 


M. 


N 


lim 

TV —>00 log log N 


= 00 . 


(26) 


In addition, assume that under a mis-specified model class, Assumptions 1,3,4 hold, mid for 
all sufficiently large N 


M n 7 


log L [ 0 N) ' 


(27) 


where 0 < q < 1 is some constant and L^ was defined in (24). Then, using the above two- 
stage strategy, the modified bridge criterion in (23) is consistent in the well-specified case 
and efficient in the mis-specified case. Moreover, if in the well-specified case L is selected 
from a finite candidate set that does not depend on N and that contains the true order L 0 , 
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then L converges almost surely to L 0 . 

Remark 6 We note that Conditions (26) and (27) are fairly weak. For instance, Lq N ^ is 
respectively 0(iV r ) (0 < r < 1) and ©(log N) in the two cases described in Proposition 1, so 
we may choose Mn = (log N) T with any 0 < r < 1. 

Remark 7 We provide an intuitive reasoning here. In the well-specified scenario, (26) 
guarantees consistency due to Theorem 2. In the mis-specified scenario, (25) and (27) imply 
that M n < Laic/ logL AIC . Such M n produces penalty increments J BC (L+1) — J BC (L) that are 
lighter than AIC for large L (recall that the candidate set in the second step is 1,..., L MC ). 
In view of that, BC produces L that is close to the boundary L MC . 

Remark 8 Another form of the modified bridge criterion is written as 

BC (IV, L) = LL (28) 

k =1 

where £ > 0, £ ^ 1. By a similar proof, it can be shown that Theorem 3 can be modified to 
the case 0 < ( < 1 by requiring the following changes: replace / log L^ } by (L^)^ in 
(27), and require q < 1 — (. In addition, Theorem 3 can be modified to the case ( > 1 via 
replacing / logL^ by /a(() in (27), where a(C) = Y^k =As a possible future 
work, it would be interesting to compare the performance of ( = 1 and ( ^ 1. 


Remark 9 Building upon the proposed bridge criterion, we define the following paramet¬ 
ricness index (PI): 


Pijv 




J AIC 


+ I L 


BC 


7 BIC| 


if L aic ^ L bic 

otherwise. 


(29) 


Following the definition, Pl^r G [0,1]. Intuitively, PIat is close to one in the well-specified 
model class where L BC , L B1C do not differ much, while close to zero in a mis-specified one 
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where L BC , L MC are close and much larger than L mc . The goal of PI is to measure the extent 
to which the specified model class is adequate in explaining the observed data, namely to 
assess the confidence that the selected model can be practically treated as the data-generating 
model. The larger Ppv, the more confidence. Similar concept has been introduced in (Liu 
& Yang 2011) for the goal of estimating the regression function. The following proposition 
shows that Pi n converges in probability to one for the well-specified case. Though we cannot 
prove that Ppv converges in probability to zero for various mis-specihed cases in general, for 
illustration purpose we prove for some typical mis-specihed cases. Experiments on various 
synthetic data in Section 5 have shown that Ppv performs in the way we expected. 

Proposition 2 Under the same conditions of Theorem 3, if the model class is well-specified, 
Pliv converges in probability to one as N goes to infinity; If the model class is mis-specified, 
and we further assume that Cm{L ) + (logIV — 2 )La 2 /N achieves its minimum at L i N ^ and 
liniAT-^ooL^/L q^ = 0, then Pl^r converges in probability to zero as N goes to infinity. An 
example is where the mismatch error satisfies |du — 'L 00 || r = cL -7 ? where 7 and c are positive 
constants. 


5. NUMERICAL RESULTS 

In this section, we present experimental results to demonstrate the theoretical results and 
the advantages of bridge criterion on both synthetic and real-world datasets. Throughout 
the experiments, we use the two-step bridge criterion defined in (23), and we adopt 

U 2 ,= LA r ' /3 J, Mn = (log N) 0 ' 9 (30) 

due to Theorem 3 and Remark 6, where N is the sample size. 

5.1 Synthetic data experiment: consistency in finitely dimensional model 

The purpose of this experiment is to show the consistency of BC and BIC. The performance 

of BC, AIC, and BIC in terms of order selection for well-specified model class is summarized 
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in Table 2. In Tabic 2, the data is simulated using autoregressive filters T 2 = [a,a 2 ] T for 
a = 0.3, —0.3, 0.8, —0.8. For each a, the estimated orders are tabulated for 1000 independent 
realizations of AR(2) processes x n + ax n -\ + a 2 x n -2 = e n , e n ~ J\f( 0,1). The experiment 
is repeated for different sample sizes N = 100, 500,1000,10000. As was expected, the per¬ 
formance of the bridge criterion lies in between AIC and BIC, and it is consistent when N 
tends to infinity. In addition, the convergence for a = 0.3, —0.3 is slightly slower compared 
with a = 0.8, —0.8, because of their smaller signal to noise ratios. 





N =100 



N =500 


N =1000 

N 

= 10000 

a 

L 

BC 

AIC 

BIC 

BC 

AIC 

BIC 

BC 

AIC 

BIC 

BC 

AIC 

BIC 


1 

784 

548 

851 

558 

213 

661 

298 

51 

405 

0 

0 

0 


2 

151 

292 

135 

372 

558 

333 

619 

677 

589 

949 

720 

999 

3 

36 

98 

13 

37 

113 

5 

38 

125 

5 

21 

97 

1 


> 3 

29 

62 

1 

33 

116 

1 

45 

147 

1 

30 

183 

0 


1 

777 

566 

845 

535 

208 

628 

297 

45 

375 

0 

0 

0 


2 

166 

301 

145 

392 

536 

365 

624 

688 

617 

958 

719 

997 

3 

28 

64 

8 

32 

110 

6 

32 

112 

7 

22 

122 

3 


> 3 

29 

69 

2 

41 

146 

1 

47 

155 

1 

20 

159 

0 


1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.8 

2 

823 

749 

957 

891 

734 

988 

906 

715 

992 

944 

726 

998 

3 

102 

148 

36 

44 

125 

11 

41 

118 

8 

24 

102 

2 


> 3 

75 

103 

7 

65 

141 

1 

53 

167 

0 

32 

172 

0 


1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

-0.8 

2 

860 

783 

968 

876 

738 

980 

878 

709 

994 

949 

703 

999 

3 

82 

127 

29 

54 

112 

18 

55 

133 

5 

23 

115 

1 


> 3 

58 

90 

3 

70 

150 

2 

67 

158 

1 

28 

182 

0 


Table 2: Selected orders for AR(2) processes (1000 realizations for each a and N ) 


5.2 Synthetic data experiment: efficiency in finitely and infinitely dimensional models 
The purpose of this experiment is to show that the proposed order selection criterion achieves 
the asymptotic efficiency for both the well-specified and the mis-specihed cases. The perfor¬ 
mance of BC in terms of mismatch error is compared with those of AIC and BIC in Table 3. 
Recall that the mismatch error defined in (15) is the expected one-step ahead prediction 
error minus the variance of noise, when an estimated filter is applied to an independent and 
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identically generated dataset. We consider three different data generating processes below. 
In Table 3, for each case and sample size N = 100, 500,1000,10000, the tabulated mismatch 
error produced by each criteria were the mean of 1000 repeated independent experiments. 
The mean parametricness index defined in Remark 9 (denoted by Pljv) in each case was also 
tabulated. 

Case 1: The first case is AR(1) with dR = [0.9], namely x n +0.9a; n _i = e n , e n rv./ A/"(0,1). 
This is a well-specified model. As we can see, once the true order is selected with probability 
close to one, the resulting predictive performance is also asymptotically optimal. 

Here, we briefly explain how to calculate the exact mismatch error in (15) for any esti¬ 
mated filter of size L that may or may not equal to Lq. If suffices to express the covariance 
matrix T u or its elements y 0 ,..., 'Yl'-i i n terms of the known T where L' = max{h 0 , L}. 
We define the correlation vector and matrix by pl 0 = [ 71 / 70 , • • •, 7 l 0 / 7 o ] t , Pl 0 = T lJ 70 • By 
rewriting the Yule-Walker equation Pl 0 ^l 0 — ~Pl 0 , we obtain (/ + $ 1 , 0 ) Pl 0 = —dR 0 where 


V’Log 

VtLo,3 • 1 

■ • V’Lo.Lo-l 

V’Lo.io 

0 


0 

0 

0 

0 

0 

^ 0,3 

VtL 0 ,4 • ' 

’ • VtLo.Lo 

0 

0 


V’Lo.l 

0 

0 

0 

0 






+ 






V’LoAo 

0 

0 

0 

0 


V’LoAo-s 

^L 0 ,L 0 -3 ■ 

• • V’Lo.l 

0 

0 

0 

0 

0 

0 

0 


fpL 0 ,L 0 -l 

V’LoTo-a • 

• • V’Log 

V’Lo.l 

0 


We thus obtain p Lo = -(/ + $ Lo ) ^Lo, 7 o = o ' 2 /! 1 + Pl 0 ^l 0 ), and 7 ^ = 7 0 p Lo ,e = 

1,..., L 0 ). Furthermore, for each £ > L 0 , 7 ^ equals to — Y^t= 1 ^ L 0 ,kli-k- 

Case 2: The second case is AR(L 0 (AT)) with L 0 (N) = |_Y a4 J and ^l 0 (n) = [0.7 fc ]^ 1 7V \; i 

namely x n + ipL 0 (N),iX n -i H - b i>L 0 {N),L 0 {N)X n -L 0 {N) = e n , e n ~ A/'(0,1). This is the case 

where the true order is large in terms of sample size, and thus it can be treated as the 
infinite dimensional model (see Remark 3). Note that all the roots of each characteristic 
polynomial have modulus 0.7. For each sample size N = 100, 500,1000,10000, the true 
order that generated the autoregression is 6,12,15, 39, respectively. 
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Case 3: The third case is the first order moving average process x n = e n — 0.8e„_i, e n ~ 
J\f( 0,1). It is an autoregression with infinite order. The exact mismatch error of an es¬ 
timated filter Al could be calculated in the following way: ||A^ — Toollp^ = E{x n+ ± + 
\x n , x n - L+1 ]A L } 2 } - a 2 = 1.64(1 + HAilla) - 2 • 0.8 (A La + Y!k=l A L,kA L ,k+i ) - 1, where 
we have used E(x 2 n ) = 1.64, E(x n x n _ i) = — 0 . 8 , and E(x n x n _k) — 0 for k > 1 . 

In summary, Table 2 and 3 show that BC achieves the performance that we had expected: 
it is consistent when the model class is well-specified, and its predictive performance is always 
close to the optimum of AIC and BIC in both well-specified and mis-specified cases. In 
practice when no prior knowledge about the model specification is available, the proposed 
method is more flexible and reliable than AIC and BIC in selecting the most appropriate 
dimension. 


Case 

N =100 

BC AIC BIC pItv 

N =500 

BC AIC BIC ppv 

1 

19.7 28.6 16.6 0.96 

(1.13) (1.28) (1.01) (0.0061) 

2.9 5.7 2.4 0.97 

(0.18) (0.26) (0.13) (0.0050) 

2 

76.7 71.9 94.2 0.58 

(1.24) (1.08) (1.33) (0.016) 

17.6 17.5 25.2 0.29 

(0.25) (0.24) (0.33) (0.014) 

3 

97.8 94.7 122.8 0.58 

(1.28) (1.12) (1.55) (0.016) 

26.6 26.6 38.0 0.32 

(0.27) (0.27) (0.41) (0.015) 


Case 

N =1000 

BC AIC BIC pItv 

N =10000 

BC AIC BIC ppv 

1 

1.6 3.4 1.3 0.98 

(0.11) (0.15) (0.065) (0.0047) 

0.11 0.39 0.10 0.99 

(0.012) (0.020) 0.0049 (0.0033) 

2 

9.9 9.9 14.6 0.18 

(0.13) (0.13) (0.18) 0.012 

1.4 1.4 2.1 0.11 

(0.019) (0.019) (0.025) (0.0097) 

3 

14.6 14.6 22.1 0.21 

(0.15) (0.15) (0.24) (0.013) 

2.02 2.02 3.19 0.032 

(0.021) (0.021) (0.032) (0.0056) 


Table 3: Mismatch errors (and their standard errors) of autoregressive models selected by 
BC, AIC, and BIC, along with the parametricness index, in three different cases (values 
except Pl N and its standard errors were rescaled by 10 3 ) 





5.3 Real data experiment: the El Nino data from 1935 to 2015 

As the largest climate pattern, El Nino serves as the most dominant factor of oceanic influence 
on climate. The NIN03 index, defined as the area averaged sea surface temperature from 
5°S-5°N and 150°W-90°W, is calculated from HadlSSTl within the range of January 1935 
to May 2015 (Rayner, Parker, Horton, Folland, Alexander, Rowell, Kent & Kaplan 2003). 
The monthly data with overall 965 points is shown in Fig. 4(a). The data seems to be highly 
dependent from its sample partial autocorrelations shown in Fig. 4(b). 

To evaluate the predictive power of BC, AIC, and BIC, ideally we would apply each 
estimated filter to independent and identically generated datasets as we have done in the 
synthetic data experiments. But it is not realistic to apply this cross-validation to a single 
real-world time series data. As an alternative, we adopt a prequential perspective (Dawid 
1984; Ing & Wei 2005), and evaluate the criteria in terms of the one-step prediction errors 
conditioning only on the past data at each time. Specifically, we start from an initial time 
step, say N 0 = 200, and obtain an estimated AR filter from the first 200 points 

under each criterion C. Upon the arrival of (n = N 0 + l)th point, The one-step prediction 
error is revealed to be e n [C) = (x n — [x n _i ,..., Xu-lI^l) 2 ■ This procedure is repeated for 
n = N 0 + 2,..., N = 965, each time the AR filter being estimated from the observed n — 1 
data points and the tuning parameters being = \n l ^\, M N = (log n) 0 ' 9 (note that the N 
in (30) was replaced by the available sample size n) . The cumulated average prediction error 
at each n is computed to be e n (C) = ^" = v 0 +i £t(C)/(n — No). To highlight the differences 
of e n (C) for C = BC, AIC, BIC, we have plotted the normalized curve e n (C) — e n (opt) in 
Fig. 4(c), where e n (opt ) = min{e n (AIC), e n (BIC)} for each n = N 0 + 1,..., N. In order 
to show predictive power that may vary at different time epochs, We have also plotted in 
Fig. 4(d) the (normalized) average prediction errors over only a sliding window of fixed size 
100, namely eo n (C) = Y^t= s +i et(C)/(n — s) where s = rna x{N 0 , n — 100}. 

In addition, in order to capture potential dynamics during different time epochs, we have 
also considered the estimation from a sliding window of fixed size Nq. Specifically, we start 
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from the same initial time step Nq = 200, and for each n — Nq + 1,..., N, the AR filters are 
estimated from only n — N 0 ,... ,n — 1 with LmJx = |_Aq^ 3 J> Mn = (l°g A 0 )°' 9 (note that the 
N in (30) was replaced by the available sample size No). Similarly, we computed the one-step 
prediction errors, the normalized cumulated average prediction errors (plotted in Fig. 4(e)), 
and the normalized windowed average prediction errors (plotted in Fig. 4(f)). Fig. 4(c)-(f) 
show that the performance of BC is close to AIC and outperforms BIC in general. 



Figure 4: (a) The monthly NIN03 index from January 1935 to May 2015; (b) the sample 
partial autocorrelations of the complete data with 95% confidence bounds; (c) the normalized 
cumulated average prediction error at each time step (using all the current observations); 
(d) the normalized average prediction error over the recent window of size 100 (using all 
the current observations); (e) the normalized cumulated average prediction error (using the 
recent N 0 observations); (f) the normalized average prediction error over the recent window 
of size 100 (using the recent N 0 observations). In subfigures (c)-(f), BC, AIC, and BIC 
are respectively marked in red, blue, and black, and the curves have been normalized by 
subtracting the minimum of AIC curve and BIC curve. 
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5.4 Real data experiment: the English temperature data from 1659 to 2014 
In this experiment, we study the monthly English temperature data from 1659 to 2014 
used by Dieppois, Durand, Fournier & Massei (2013), which is perhaps the longest recorded 
environmental data in human history. We have pre-processed the raw data by subtracting 
each month by the average of that month over the 356 years. The de-seasoned data (with 
overall N = 4272 points) is plotted in Fig. 5(a). Its sample partial autocorrelations are 
shown in Fig. 5(b). In order to capture potential dynamics during such a long period, we 
adopt the prequential approach that was used to draw Fig. 4(f), and omit the counterpart 
of Fig. 4(c) (d)(e). Specifically, we started from N 0 = 500, and for each n — N 0 + 1,..., N 
the one-step ahead prediction was made by an AR filter produced from the recent window 
of N 0 observations. The prediction errors e n were averaged over a fixed window of size 100, 
namely eo n (C) = Ylt=s+ iO(C)/(n — s ) where s = max{A' r 0 ,n — 100}. We have plotted in 
Fig. 5(c) the normalized average prediction errors, which is eo„(C)— eo n (opt) where eo n {opt) = 
min{eo n (AIC), eo n (BIC)} (similar as before). We highlight the normalized average prediction 
errors within the range n = N 0 + 500,..., N 0 + 1500 in Fig. 5(d). In this experiment, AIC is 
not constantly superior to BIC, and BC adaptively chooses to be close to the optimum of AIC 
and BIC. Furthermore, BC achieves the best predictive performance in some regions. The 
results show that BC is more flexible and reliable than AIC and BIC in practical applications. 
Note that we have adopted a specific choice of Lmlx and Mjv (see (30)) throughout all the 
synthetic and real-world data experiments. In practice, an analyst may achieve much better 
predictive performance of BC, by fine tuning L ™t and Mn for any particular real dataset. 

6 . DISCUSSION 

There have been many debates on which of AIC and BIC should be used (Burnham & 
Anderson 2004). A practitioner who supports AIC may argue that all models are wrong, 
and thus it is safe to choose AIC that generally performs better in mis-specihed situations. In 
contrast, a practitioner who supports BIC is usually in favor of the mathematically appealing 
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Figure 5: (a) The de-seasoned data; (b) the sample partial autocorrelations of the complete 
data with 95% confidence bounds; (c) the normalized cumulated average prediction error 
at each time step (using the recent N 0 observations); (d) the normalized average prediction 
error over the recent window of size 100 (using the recent N 0 observations). In subfigures 
(c)-(d), BC, AIC, and BIC are respectively marked in red, blue, and black, and the curves 
have been normalized by subtracting the minimum of AIC curve and BIC curve. 

“consistency” property and is quite confident that the candidate set of models contains the 
true (or practically a very good) model, or simply has a strong preference of parsimony in 
modeling. However, the debate is aroused due to the underlying assumption which tends to 
be overlooked: a practitioner should choose either AIC or BIC before even looking at the ob¬ 
served data—if some model specification test were done, the practitioner might have changed 
his/her prejudice. In a certain sense, the bent curve of bridge criterion, different from straight 
lines, was designed to mimic a sequence of model specification test which continuously check 
“whether there exists a finite dimension Lq underlying the observed data”. For practical 
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situations where there is no prior information, bridge criterion provides a practitioner with 
opportunities to change or reinforce his/her belief in the model specification. 

As a possible future work, it would be interesting to see in what extent the bridge criterion 
can be extended to other model selection problems, for instance the vector autoregressive 
model, autoregressive-moving-average model, and generalized linear model. 
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